To address the inadequate feature extraction from data space and low prediction accuracy in traditional deep learning based video prediction, a video prediction model Combining Involution and Convolution Operators (CICO) was proposed. The model enhanced video prediction performance through three aspects. Firstly, convolutions with varying kernel sizes were adopted to enhance extraction ability of multi-granularity spatial features and enable multi-angle representational learning of targets. In particular, larger kernels were applied to extract features from broader spatial ranges, while smaller kernels were employed to capture motion details more precisely. Secondly, large-kernel convolutions were replaced by the computationally efficient involution operators with fewer parameters in order to achieve efficient inter-channel interaction, avoid redundant parameters, decrease computational and storage costs. The predictive capacity of the model was enhanced at the same time. Finally, convolutions with kernel size 1×1 were introduced for linear mapping to strengthen joint expression between distinct features, improve parameter utilization efficiency, and strengthen prediction robustness. The proposed model’s superiority was validated through comprehensive experiments on various datasets, resulting in significant improvements over the state-of-the-art SimVP (Simpler yet Better Video Prediction) model. On Moving MNIST dataset, the Mean Squared Error (MSE) and Mean Absolute Error (MAE) were reduced by 25.2% and 17.4%, respectively. On Traffic Beijing dataset, the MSE was reduced by 1.2%. On KTH dataset, the Structure Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR) were improved by 0.66% and 0.47%, respectively. It can be seen that the proposed model is very effective in improving accuracy of video prediction.